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Abstract 

We propose a strategy for disclosure risk evaluation and disclosure control of 
a microdata set based on fitting decomposable models of a multiway contingency 
table corresponding to the microdata set. By fitting decomposable models, we 
can evaluate per-record identification (or re-identification) risk of a microdata set. 
Furthermore we can easily determine swappability of risky records which does not 
disturb the set of marginals of the decomposable model. Use of decomposable 
models has been already considered in the existing literature. The contribution of 
this paper is to propose a systematic strategy to the problem of finding a model 
with a good fit, identifying risky records under the model, and then applying the 
swapping procedure to these records. 

1 Introduction 

In this paper we propose a systematic strategy of per-record identification risk and dis- 
closure control of risky records of a microdata set by fitting decomposable models to 
a multiway contingency tables corresponding to the microdata. The first stage of our 
strategy consists of selecting decomposable models with a good fit to the data based on 
Akaike's information criterion (AIC). Since the number of decomposable models is large, 
we propose an algorithm to find locally optimum decomposable models. The second stage 
is to evaluate cell probabilities of sample unique records and to estimate the number of 
population uniques in the microdata set based on the chosen model. The third stage 
consists of disclosure control of risky records by swapping. We consider swapping which 
does not disturb the set of marginals corresponding to the chosen model. 

In evaluating the disclosure risk of a given microdata set, the number of the popu- 
lation uniques among the sample unique records has been considered to an important 
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overall measure of the disclosure risk. Starting from Poisson-Gamma model (jl]) various 
models of random partitions have been proposed for estimating the number of population 
uniques. See a series of works of Hoshino ( IH, El, [HI; [HI) an d references therein. These 
models treat the sample unique records exchangeably and hence the estimated conditional 
probability of population uniqueness is common for every sample unique record. How- 
ever some sample unique records are clearly more likely to be population uniques than 
other records, according to "rareness" of the records. If a sample unique has outlying 
observations or has very a rare combination of observed characteristics, it is likely to be a 
population unique. A simple descriptive method for evaluating per-record identification 
risk is to look at minimum unsafe combination of variables for a sample unique record 

(0)- 

More systematic way of evaluating the per-record identification risk is to model cell 
probabilities of the contingency table corresponding to a microdata set, where all the key 
variables of the microdata set are categorized and the joint frequencies of the key variables 
are counted. If the estimated cell probability of a sample unique cell is very small, then 



the sample unique is rare and risky. This approach was investigated in [1J| , |8j , [3j . They 
used the standard log-linear models for cell probabilities of contingency tables. 

In actual evaluation of disclosure risk, we often have to consider 10 or more possible 
key variables. Then the contingency table is large and sparse and the estimation of cell 
probabilities of standard log-linear models is not straightforward, except for decomposable 
models. In Section 12.21 we consider an example of a 8-way contingency table from 1990 
U.S. Census of Population and Housing data. From the viewpoint of disclosure control 
this example is of moderate size but the contingency table corresponding to the microdata 
has more than 12 million cells. 

Because of the computational difficulty Takemura jlH considered Lancaster-type ad- 
ditive modeling of cell probabilities. However in fitting additive models estimated cell 
probabilities often become negative, especially for empty cells. In this sense additive 
models are not satisfactory for estimating small cell probabilities, although they are use- 
ful for the purpose of relative evaluation of identification risks of sample unique cells. 

Among the log-linear models, decomposable models are special in the sense that the 
maximum likelihood estimates of the cell probabilities can be explicitly written as ratios 
of products of marginal frequencies. Unlike other log-linear models, in a decomposable 
model cell probability of each cell can be separately estimated. This is a very attractive 
feature of decomposable model, because we are mainly interested in sample unique cells or 
other cells of small frequency. Furthermore model selection among decomposable models 
is relatively easy, because the maximized log likelihood and the degrees of freedom can be 
simply evaluated. For fitting other log-linear models, we need some iterative procedure 
such as iterative proportional sealing (see e.g. 1). For large contin gency tables iterative 
proportional scaling is computationally very intensive, because cell probability estimates 
of all the cells have to be stored in some form and updated in each iteration. 

Estimation and diagnostics of a particular decomposable model is easy. However if 
the number m of key variables is large, there are many possible decomposable models. In 
Table 121 below, for our example of m = 8 key variables, there are more than 30 million 
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possible decomposable models. Finding the best fitting model among more than 30 million 
possible models is impractical. We propose to find several locally optimum models and 
choose one of these models. 

Once a decomposable model with a good fit is obtained, we look at sample unique 
cells with very small estimated cell probabilities. If the cells are considered to be risky, it 
is desirable to perform some disclosure control measure to these cells. From the viewpoint 
of log-linear model, it is natural to consider swapping of these risky records in such a way 
that the swapping does not disturb the given set of marginals corresponding to the cliques 
of the decomposable model. This is based on the fact that the set of marginals constitutes 
the sufficient statistic of the model and swap ping does not influence statistical inferences 
based on the model. Using the results of |l8( we show that it is straightforward to 
determine whether a particular record is swappable and find another record for swapping 
if swapping is possible. 

The organization of the paper is as follows. In Section El we summarize preliminary 
material and introduce our working example. In Sectional we discuss fitting and selection 
of decomposable models. In Section 0] based on a chosen decomposable model we evaluate 
per-record identification risk. In Section El we perform swapping of risky records. Section 
E]cnds the paper with some concluding remarks. 

2 Preliminaries and a working example 

In this section we prepare notations on decomposable models and describe a working 
example analyzed in this paper. 

2.1 Notations on decomposable models 

We follow the notation of |13J. Let A = {1, . . . , m} denote the set of the key variables. 
Each variable is denoted by 5 £ A. We assume that all key variables are already discretized 
and let Za = {1, . . . , Ig} denote the set of categories of S. Each cell is indexed by m indices 
i = («!,..., i m ) and the set of the cells is the direct product Z = EbeA-^- ^ ne f rec L uenc y 
of cell i is denoted by n(i). 

Let a C A be a subset of variables. Then an a-marginal cell i a of i = (ii, . . . ,i m ) 
is defined as i a = (is)se a - The set of a-marginal cells is Z a = n<5ga-^<5- The marginal 
frequency of a-marginal cell i a is written as 



where j a = i a means %k = jk, G a. Let n = X]ie:r n (*) denote the sample size (number 
of records) of the microdata set. We denote the relative frequency of a cell i and a 
marginal cell i a by 




3 -3a — *o 




n 



n 
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We use the same notation for cell probabilities p(i), p{i a ), etc. 

Consider a graph G = (A,E) with the set of vertices A and the set of edges E. Let C 
denote the set of (maximal) cliques. For a subset a C A let \i a : X — > R denote a function 
of i which only depends on the marginal cell i a , i.e. fi a {i) — A*a(*a)- Then the graphical 
model associated with G specifies the cell probability p(i) as 

logp(t) =5^//o(*o). (1) 

aGC 

A graph G is chordal (decomposable, triangulated), if every cycle of length I > 4 
has a chord. A graphical model with a chordal G is called a decomposable model. For 
a decomposable model, the cliques can be ordered to satisfy the running intersection 
property: 

(RIP) For each 2 < j < m, there exists 1 < k < j — 1, such that 

^ = Cj n (Ci u c 2 u • • • u Cj_i) c c fe . 

An ordering (Ci, . . . , C m ) satisfying RIP is called a perfect sequence. S2, ■ ■ ■ , S m are min- 
imal vertex separators of G. The number of times a minimal vertex separator S appears 
in any perfect sequence is the same and called the multiplicity of S. We denote the mul- 
tiplicity of S by v(S). S denotes the set of minimal vertex separators. In the following 
we simply say "separator" to mean a minimal vertex separator. 

The maximum likelihood estimate (MLE) of a decomposable model is explicitly written 

as 



if r(i c ) > 0, VC G C, 



PMUi) = {Tl S esr(isyW ' ' (2) 

0, otherwise. 

The degrees of freedom is also simply written (Proposition 4.35 of [lif)- 

c<=c sec s<=s ses 
Hence AIC for model selection is also easily computed. 

AIC = — 2 x (log likelihood) + 2 x (degrees of freedom). (4) 

In Table ^ we list the number of graphical models and the number of decomposable 
models for m-way contingency tables up to m = 8. We see that the number of de- 
composable models increases very fast with m. The number in the parentheses for the 
decomposable model indicates the number of chordal graphs of m vertices after identifi- 
cation of isomorphic graphs, i.e., we do not distinguish graphs which can be obtained by 
relabeling of vertices. Based on |4| we provide a list of non-isomorphic chordal graphs for 
m < 8 in [5]. Given a list of non-isomorphic chordal graphs we can pick a decomposable 
model by choosing an graph from the list and arbitrary assigning a variable to each vertex 
of the graph. 
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Table 1: Number of graphical models and decomposable models 



m 


graphical 


decomposable 


2 


2 


2(2) 


3 


8 


8(4) 


4 


64 


61 (10) 


5 


1024 


820 (27) 


6 


32768 


18154 (96) 


7 


2097152 


617675 (469) 


8 


268435456 


30888596 (3734) 



2.2 A working example 

In this paper we apply our strategy to a test data set from 1990 U.S. Census of Population 
and Housing Public Use Microdata Samples. We subsampled n = 9809 individuals from 
the state of Washington and chose m = 8 variables for our experiment. 



1. Relationship (14 categories) 


2. Sex (2 categories) 


3. Age (91 categories) 


4. Marital status (5 categories) 


5. Place of birth (14 categories) 


6. Spouse present /absent (7 categories) 


7. Own child (2 categories) 


8. Age of own child (5 categories) 



The population size of the state of Washington is about iV = 4, 867, 000. The dataset can 
be viewed as a 8-way contingency table of the type 

14x2x91x5x14x7x2x5 

with approximately 12.5 million cells (more exactly 12,485,200 cells). We see that the 
contingency table is very sparse with only n = 9809 counts in 12.5 million cells. We took 
these m = 8 variables from a PUMS data set without further global recoding. For example 
we used the age itself with 91 categories. This is somewhat unrealistic for evaluation of 
disclosure risk. On the other hand there are other possible key variables in the original 
PUMS data set. 

It should be noted that although the (formal) total number of cells 12,485,200 is very 
large, the effective total number should be much smaller because of structural zeros. For 
example there is no age of own child if there is no own child. In this case the age of own 
child is coded as N/A in the original data set. Also there is an obvious relation between 
age and marital status. In this paper we ignore the effect of structural zeros. See Section 
IHlfor more discussion. 

For reference we show first few lines of 9809 x 8 data matrix. 

00,0,17,4,10,6,0,0 
00,0,17,4,52,6,0,0 



5 



00,0,18,0,23,1,0,0 
00,0,18,0,24,1,0,0 
00,0,18,0,51,1,0,0 

The frequencies of the cell sizes (size indices, frequency of frequencies) of this data set 
is given as follows. The table shows that there are 2243 cells of frequency 1, 524 cells of 
frequency 2, etc. 



Cell size 


1 


2 


3 


4 


5 


6 


7 


8 9 


10 


11 < 


Frequency 


2243 


524 


275 


132 


104 


60 


59 


34 46 


19 


124 



We are interested in estimating the number of population uniques among 2243 sample 
uniques and evaluate which sample record is particularly risky. As a preliminary analysis, 
we fitted Ewens model, Pitman model and Lancaster-type additive model. The estimates 
of the number of population uniques of these models are as follows. 

Ewens model: 5.9, Pitman model: 214.0, additive model: 252.1. 

3 Selection of decomposable models 

The first step of our strategy is to choose a decomposable model which fits the data. 
As shown in Table the number of possible decomposable models grow very fast as 
the number of variables m increases. For m < 8 we can use the list of non-isomorphic 
chordal graphs available at [5]. We present the following Algorithm 1 to obtain locally 
best decomposable model in terms of AIC Application of Algorithm 1 to the data set of 
our working example is summarized in Table El below. 

In our algorithm we add or subtract an edge to (or from) a chordal graph to move to 
another chordal graph and evaluate AIC. It outputs a model with locally minimum AIC. 
We can apply our algorithm from various initial models and compare these locally best 
models to obtain approximately a globally best model. 

Notations of Algorithm 1 is as follows. G = G(V, E) = G(V, Eg) is a graph with the 
set of vertices V and the set of vertices E. M(G) denotes the graphical model associated 
with G. E(C m ) denotes the set of edges of the complete graph with m vertices. 

In Step 1 we choose an initial model randomly from the list of non-isomorphic decom- 
posable mo dels (0|, |H). Then we randomly label the vertices to obtain a decomposable 
model. We will discuss random generation of initial models for m > 8 in Algorithm 2 
below. 

In Step 2 we choose the candidate for next decomposable model. We add or subtract 
an edge and determine whether the resulting graph is chordal. If it is chordal we evaluate 
its AIC. For evaluating AIC we need to obtain the set of cliques and the set of separators. 
Chordality of a graph is determined by obtaining a perfect elimination scheme and the set 
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of cliques and the separators are obtained by "Maximum cardinality search" algorithm 



Algorithm 1 Model selection of decomposable models. 

Input: Microdata D, List of non-isomorphic chordal graphs C m with m vertices 

Output: Model M with local minimum AIC 

1. Choose a chordal graph from H G C m at random; 

Label vertices of H at random and obtain a chordal graph G next ; 
A ncxt ^AICofM(G ncxt ); 

2. while / = true do 

/ = false; 
n , n 

for each e G E(C m ) do 
if e G Eq then 

G <- G(V, G(£) \ e) 
else 

G' «- G(V,G(E)Ue); 

if G' is chordal then 

A' <— AIC of M(G'); 

if A' < A ncxt then 
- /^/. 

^ncxt T ^ i 
^4-next < A ; 

/ true; 

3. Output Af(G); 

For m > 8 we can propose the following algorithm to generate an initial decomposable 
model to replace Step 1 of Algorithm 1. Given a chordal graph G with m vertices, 
we can obtain a chordal graph G' with m + 1 vertices by adding the m + l'st vertex 
and connecting it to a subset of one clique C of G. Since a chordal graph possesses a 
perfect sequence of cliques, the above recursive procedure generates all chordal graphs. 
The following Algorithm 2 outputs the set of cliques of a random chordal graph. Note 
that the probability distribution on random choices in the algorithm is not specified and 
the distribution of the output is not necessarily the uniform distribution over the set of 
chordal graphs with m vertices. 

Algorithm 2 "Random" chordal graph with m vertices. 
Input: m 

Output: Set of cliques of a random chordal graph with m vertices. 

1. Initialize C = 0; 

2. for j <— 1 until m do 

Flip a coin; 
if heads then 

e^cu{{j}} 

else choose a member C G C and a subset G C C at random; 
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if C = C then 

C«-CU{j} 
else 

C^CU{C'U{j}}; 
3. Output C; 



4 Per-record identification risk and estimate of the 
number of population uniques 

When a good fitting decomposable model is chosen we can estimate the cell probability of 
a sample unique cell by MLE J2J). Then a natural estimate of the conditional probability 
that the sample unique cell i is also a population unique is given as 

(I-PmleW)""", (5) 

where N is the population size and n is the sample size. (J5J is the estimated probability 
that none of the remaining N — n individuals in the population fall into cell i, under the 
assumption that individuals fall into cells independently from each other according to the 
estimated probability distribution. The number of population uniques in the sample can 
be estimated as 

(I-PmleOO)^. 

i:sample unique 

In Table 121 we show two models with smallest values of AIC by applying Algorithm 1 
100 times to our example. Algorithm 1 converged after a few transitions and it seems to 
be very practical. These two models were also most frequently obtained from Algorithm 
1. In both models, the separator {6} has multiplicity 2 as indicated by the repetition 
in the table. The estimated numbers of population uniques (48.867, 40.51) are between 
those of Ewens model and Pitman model and seem to be reasonable. The variable 6 
(Spouse present /absent) is contained in many cliques, which can be explained by its high 
correlation with other variables and yet small degrees of freedom. On the other hand 
variable 5 (Place of birth) is contained in a single clique (i.e. it is a simplicial vertex), 
which is also reasonable. 

Furthermore the sample uniques with very small estimated cell probabilities (p(i) < 
10 -8 ) are common to these two models. We might consider some disclosure control mea- 
sure for about 20 sample uniques with estimated cell probability less than 10~ 7 . 



5 Swappability of risky records 

In TableEltwo records have the estimated cell probability of less than 10~ 8 . They probably 
need some disclosure control. In this paper we propose to swap some observations of these 
records with other records of the data set. Since we have found a decomposable model 
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Table 2: Chosen models 





iviociei i 


iviociei z 


Number of times chosen 


11 


7 


AIL/ 2 


13869.07 


13984.97 


1 1*1 1*1 l 

log likelihood 


— 12141.07 


-12013.97 


degrees of freedom 


1728 


1971 


estimated # of population uniques 


48.867 


40.515 


cliques 


{1,2,6}, {1,6,7},{2,6,8}, 
{3,6,7},{4,6},{5,6} 


{1,6,7}, {3,6,7},{1,6,8}, 
{2,8},{4,6},{5,6} 


separator 


{1,6}, {2,6}, {6,7}, {6}, {6} 


{1,6}, {6, 7}, {6}, {6}, {8} 


cell probability estimates 


frequencies 


frequencies 


KT 2 to 10~ 3 








10- 3 to KT 4 


352 


351 


10- 4 to 1(T 5 


1092 


1117 


lO" 5 to 10~ 6 


599 


600 


10~ 6 to 1(T 7 


179 


158 


10~ 7 to 10~ 8 


19 


15 


KT 8 to 10~ 9 


2 


2 


io~ 9 to io~ 10 









with a good fit, it is desirable to swap the observations such that the marginal frequencies 
for the cliques of the chosen model is not disturbed. In 0] we give some necessary and 
sufficient conditions for swappability of a particular sample unique record with some other 
record without disturbing a given set of marginals. 

For a decomposable model, a simple method for searching another record for swapping 
can be described as follows. Let i be a sample unique record, such that we want to swap 
some observations of this record with another record. Let C be the set of cliques of a 
chosen model and let S denote the set of minimal vertex separators. Write each separator 
S as the intersection of two cliques S = CnC. We consider all triples (C, C", S) such that 
S = C n C For example in Model 1 in Tabled all possible ways of writing separators 
are as follows. 



{1,6} 


= {l,2,6}n{l,6,7}, 






{2,6} 


= {l,2,6}n{2,6,8}, 






{6,7} 


= {1,6,7}H{3,6,7}, 






{6} 


= {l,2,6}n{3,6,7} = 


{1,2,6}H{4,6} = 


{1,2,6}H{5,6} 




= {l,6,7}n{2,6,8} = 


{1,6,7}H{4,6} = 


{1,6,7}H{5,6} 




= {2, 6, 8} n {3, 6, 7} = 


{2, 6, 8} n {4, 6} = 


{2, 6, 8} n {5, 6} 
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= {3, 6, 7} H {4, 6} = {3, 6, 7} n {5, 6} = {4, 6} n {5, 6}. 



For a particular sample unique record i, we search other records j i such that for 
some (C, C, S) we have 

is = 3s, ic ^ 3c, ic> ^ 3c (6) 

If we find some j and some (C, C, S) such that © holds, then we can swap some obser- 
vations between i and j. 

We applied this procedure to 50 sample unique records with small estimated cell 
probabilities in Table El For both models of Table El this procedure quickly found other 
records for swapping for most of 50 records, including the two records with the estimated 
cell probability of less than 10 -8 . Therefore this procedure seems to work very well in 
practice. 

Note that © is a sufficient condition for swappability between i and j for a decom- 
posable model. For a full statement of necessary and sufficient conditions for general 
hierarchical model see Section 3 of [18j ]. 

6 Concluding remarks 

In this paper we proposed a systematic strategy for disclosure risk evaluation and dis- 
closure control of microdata set by fitting decomposable models. We have restricted our 
attention to decomposable models in view of computational convenience. Clearly it is de- 
sirable to consider other hierarchical models such as the model containing all two-factor 
interaction terms. Simpler hierarchical model might give a better fit than more compli- 
cated decomposable model. One strategy we can try is to look for hierarchical models 
which improves the fit around a locally best decomposable model. 

We have used AIC for evaluating the fit of the model. Theoretically AIC is justified 
for large sample size. In disclosure control problems we are dealing with large and sparse 
tables and from theoretical viewpoint use of AIC is not justified . However in practice it 
is simple and seems to work reasonably well. It is of interest to investigate other methods 
of model selection for evaluating the fit of various models. 

In microdata sets of official statistics, there are large number of structural zeros due to 
various logical relations between key variables. In principle we should list all the logical 
relations and specify structural zeros before fitting a model. But this is very cumbersome. 
Also the calculation of degrees of freedom of a model becomes complicated. It is desirable 
to develop some practical methods to deal with structural zeros in some automatic way. 

If we want to swap some observations from a sample unique record i and if we can 
find many other records j for swapping, it might be desirable to use j which is close to 
i in some sense. In jlij we considered swapping of observations between close records by 
introducing an appropriate distance function between records. 
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